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Abstract. Natural signals and images are well-known to be approximately sparse in transform domains such 
as Wavelets and DCT. This property has been heavily exploited in various applications in image 
processing and medical imaging. Compressed sensing exploits the sparsity of images or image 
patches in a transform domain or synthesis dictionary to reconstruct images from undersampled 
measurements. In this work, we focus on blind compressed sensing, where the underlying sparsifying 
transform is a priori unknown, and propose a framework to simultaneously reconstruct the underlying 
image as well as the sparsifying transform from highly undersampled measurements. The proposed 
block coordinate descent type algorithms involve highly efficient optimal updates. Importantly, we 
prove that although the proposed blind compressed sensing formulations are highly nonconvex, our 
algorithms are globally convergent (i.e., they converge from any initialization) to the set of critical 
points of the objectives defining the formulations. These critical points are guaranteed to be at 
least partial global and partial local minimizers. The exact point (s) of convergence may depend 
on initialization. We illustrate the usefulness of the proposed framework for magnetic resonance 
image reconstruction from highly undersampled k-space measurements. As compared to previous 
methods involving the synthesis dictionary model, onr approach is mnch faster, while also providing 
promising reconstruction quality. 

Key words. Sparsifying transforms, Inverse problems, Compressed sensing. Medical imaging, Magnetic reso¬ 
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1. Introduction. Sparsity-based techniques have become extremely popular in various ap¬ 
plications in image processing and imaging in recent years. These techniques typically exploit 
the sparsity of images or image patches in a transform domain or dictionary to compress [47], 
denoise, or restore/reconstruct images. In this work, we investigate the subject of blind com¬ 
pressed sensing, which aims to reconstruct images in the scenario when a good sparse model 
for the image is unknown a priori. In particular, we will work with the classical sparsifying 
transform model [70] that has been shown to be useful in various imaging applications. In 
the following, we briefly review the topics of sparse modeling, compressed sensing, and blind 
compressed sensing. We then list the contributions of this work. 

1.1. Sparse Models. Sparsity-based techniques in imaging and image processing typically 
rely on the existence of a good sparse model for the data. Various models are known to 
provide sparse representations of data such as the synthesis dictionary model [24, 7], and the 
sparsifying transform model [70, 56]. 

The synthesis dictionary model suggests that a signal y G C” can be sparsely represented 
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as a linear combination of a small number of atoms or columns from a synthesis dictionary 
D G i.e., y = Dx with x G being sparse, i.e., ||x||q <C K. The /q quasi norm 

measures the sparsity of x by counting the number of non-zeros in x. Real-world signals 
usually satisfy y = Dx -t- e, where e is the approximation error in the signal domain [7]. 
The well-known synthesis sparse coding problem finds x that minimizes \\y — -Da :||2 subject 
to ||x||q < s, where s is a given sparsity level. A disadvantage of the synthesis model is 
that sparse coding is typically NP-hard (Non-deterministic Polynomial-time hard) [50, 17], 
and the various approximate sparse coding algorithms [53, 46, 16, 12, 32, 22] tend to be 
computationally expensive for large scale problems. 

The alternative sparsifying transform model suggests that a signal y is approximately 
sparsifiable using a transform W G The assumption here is that Wy = x + y, where 

X G C"* is sparse in some sense, and y is a small residual error in the transform domain 
rather than in the signal domain. Natural signals and images are known to be approximately 
sparsihable by analytical transforms such as the discrete cosine transform (DCT), Wavelets 
[45], finite differences, and Contourlets [18]. The advantage of the transform model over the 
synthesis dictionary model ^ is that sparse coding can be performed exactly and cheaply by 
zeroing out all but a certain number of nonzero transform coefficients of largest magnitude 
[70]. 

1.2. Compressed Sensing. In the context of imaging, the recent theory of Compressed 
Sensing (CS) [8, 19, 9] (see also [28, 5, 27, 79, 6, 29, 84, 4] for the earliest versions of CS 
for Fourier-sparse signals and for Fourier imaging) enables accurate recovery of images from 
significantly fewer measurements than the number of unknowns. In order to do so, it requires 
that the underlying image be sufficiently sparse in some transform domain or dictionary, and 
that the measurement acquisition procedure be incoherent, in an appropriate sense, with the 
transform. However, the image reconstruction procedure for compressed sensing is typically 
computationally expensive and non-linear. 

The image reconstruction problem in compressed sensing is typically formulated as follows 

(1.1) min ||Tx||n s.t. Ax = y 

Here, x G is a vectorized representation of the image (obtained by stacking the image 
columns on top of each other) to be reconstructed, and y G denotes the measurements. 
The operator A G with m <C p is known as the sensing matrix, or measurement matrix. 

The matrix T G is a sparsifying transform (typically chosen as orthonormal). The aim 
of Problem (1.1) is to find the image satisfying the measurement equation Ax = y, that is the 
sparsest possible in the \k-transform domain. Since, in CS, the measurement equation Ax = y 
represents an underdetermined system of equations, an additional model (such as the sparsity 
model above) is needed to estimate the true underlying image. 

When T is orthonormal, Problem (1.1) can be rewritten as 

(1.2) min ||z||n s.t. = y 

^The sparsifying transform model enjoys similar advantages over other models such as the noisy signal 
analysis dictionary model [70] , which is not discussed here for reasons of space. 
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where we used the substitution 'i’x = z, and (•)^ denotes the matrix Hermitian (conjugate 
transpose) operation. Similar to the synthesis sparse coding problem, Problem (1.2) too is 
NP-hard. Often the Iq quasi norm in (1.1) is replaced with its convex relaxation, the h 
norm [20], and the following convex problem is solved to reconstruct the image, when the CS 
measurements are noisy [38, 21]. 

(1.3) min ll^x - yllo + A ll'hx]], 

X 

In Problem (1.3), the £2 penalty for the measurement fidelity term can also be replaced 
with alternative penalties such as a weighted £2 penalty, depending on the physics of the 
measurement process and the statistics of the measurement noise. 

Recently, CS theory has been applied to imaging techniques such as magnetic resonance 
imaging (MRI) [38, 39, 10, 76, 34, 58, 62], computed tomography (CT) [11, 13, 35], and 
Positron emission tomography (PET) imaging [78, 43], demonstrating high quality recon¬ 
structions from a reduced set of measurements. Such compressive measurements are highly 
advantageous in these applications. For example, they help reduce the radiation dosage in 
CT, and reduce scan times and improve clinical throughput in MRI. Well-known inverse prob¬ 
lems in image processing such as inpainting (where an image is reconstructed from a subset 
of measured pixels) can also be viewed as compressed sensing problems. 

1.3. Blind Compressed Sensing. While conventional compressed sensing techniques uti¬ 
lize hxed analytical sparsifying transforms such as wavelets [45], hnite differences, and con- 
tourlets [18], to reconstruct images, in this work, we instead focus on the idea of blind com¬ 
pressed sensing (BCS) [64, 65, 30, 36, 80], where the underlying sparse model is assumed 
unknown a priori. The goal of blind compressed sensing is to simultaneously reconstruct the 
underlying image(s) as well as the dictionary or transform from highly undersampled mea¬ 
surements. Thus, BCS enables the sparse model to be adaptive to the specihc data under 
consideration. Recent research has shown that such data-driven adaptation of dictionaries 
or transforms is advantageous in many applications [23, 41, 1, 42, 85, 64, 69, 66, 73, 81]. 
While the adaptation of synthesis dictionaries [52, 26, 2, 83, 75, 40] has been extensively 
studied, recent work has shown advantages in terms of computation and application-specihc 
performance, for the adaptation of transform models [70, 73, 81]. 

In a prior work on BCS [64], we successfully demonstrated the usefulness of dictionary- 
based blind compressed sensing for MRI, even in the case when the undersampled measure¬ 
ments corresponding to only a single image are provided. In the latter case, the overlapping 
patches of the underlying image are assumed to be sparse in a dictionary, and the (unknown) 
patch-based dictionary, that is typically much smaller in size than the image, is learnt directly 
from the compressive measurements. 

BCS techniques have been demonstrated to provide much better image reconstruction 
quality compared to compressed sensing methods that utilize a fixed sparsifying transform 
or dictionary [64, 36, 80]. This is not surprising since BCS methods allow for data-specific 
adaptation, and data-specihc dictionaries typically sparsify the underlying images much better 
than analytical ones. 
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1.4. Contributions. 

1.4.1. Highlights. The BCS framework assumes a particular class of sparse models for 
the underlying image(s) or image patches. While prior work on BCS primarily focused on the 
synthesis dictionary model, in this work, we instead focus on the sparsifying transform model. 
We propose novel problem formulations for BCS involving well-conditioned or orthonormal 
adaptive square sparsifying transforms. Our framework simultaneously adapts the sparsifying 
transform and reconstructs the underlying image(s) from highly undersampled measurements. 
We propose efficient block coordinate descent-type algorithms for transform-based BCS. Im¬ 
portantly, we establish that our iterative algorithms are globally convergent (i.e., they converge 
from any initialization) to the set of critical points of the proposed highly non-convex BCS 
cost functions. These critical points are guaranteed to be at least partial global and par¬ 
tial local minimizers. The exact point(s) of convergence may depend on initialization. Such 
convergence guarantees have not been established for prior blind compressed sensing methods. 

Note that although we focus on compressed sensing in the discussions and experiments 
of this work, the formulations and algorithms proposed by us can also handle the case when 
the measurement or sensing matrix A is square (e.g., in signal denoising), or even tall (e.g., 
deconvolution). 

1.4.2. Magnetic Resonance Imaging Application. MRI is a non-invasive and non¬ 
ionizing imaging technique that offers a variety of contrast mechanisms, and enables excellent 
visualization of anatomical structures and physiological functions. However, the data in MRI, 
which are samples in k-space of the spatial Fourier transform of the object, are acquired se¬ 
quentially in time. Hence, a major drawback of MRI, that affects both clinical throughput 
and image quality especially in dynamic imaging applications, is that it is a relatively slow 
imaging technique. Although there have been advances in scanner hardware [57] and pulse 
sequences, the rate at which MR data are acquired is limited by MR physics and physiological 
constraints on RF energy deposition. Compressed sensing MRI (either blind, or with known 
sparse model) has become quite popular in recent years, and it alleviates some of the aforemen¬ 
tioned problems by enabling accurate reconstruction of MR images from highly undersampled 
measurements. 

In this work, we illustrate the usefulness of the proposed transform-based BCS schemes 
for magnetic resonance image reconstruction from highly undersampled k-space data. We 
show that our adaptive transform-based BCS provides better image reconstruction quality 
compared to prior methods that involve fixed image-based, or patch-based sparsifying trans¬ 
forms. Importantly, transform-based BCS is shown to be more than lOx faster than synthesis 
dictionary-based BCS [64] for reconstructing 2D MRI data. The speedup is expected to be 
much higher when considering 3D or 4D MR data. These advantages make the proposed 
scheme more amenable for adoption for clinical use in MRI. 

1.5. Organization. The rest of this paper is organized as follows. Section 2 describes our 
transform learning-based blind compressed sensing formulations and their properties. In Sec¬ 
tion 3, we derive efficient block coordinate descent algorithms for solving the BCS Problems, 
and discuss the algorithms’ computational costs. In Section 4, we present novel convergence 
guarantees for our algorithms. The proof of convergence is provided in the Appendix. Section 
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5 presents experimental results demonstrating the convergence behavior, performance, and 
computational efficiency of the proposed scheme for the MRI application. In Section 6, we 
conclude with proposals for future work. 

2. Problem Formulations. 

2.1. Synthesis dictionary-based Blind Compressed Sensing. The compressed sensing- 
based image reconstruction Problem (1.3) can be viewed as a particular instance of the fol¬ 
lowing constrained regularized inverse problem, with (^(x) = AUTxllj^, zz = 1, and S = 


(2.1) min V\\Ax — y\\\ + C,{x) 

x&S 

However, CS image reconstructions employing fixed, non-adaptive sparsifying transforms typ¬ 
ically suffer from many artifacts at high undersampling factors [64] . Blind compressed sensing 
allows for the sparse model to be directly adapted to the object(s) being imaged. For example, 
the overlapping patches of the underlying image may be assumed to be sparse in a certain 
adaptive model. In prior work [64], the following patch-based dictionary learning regularizer 
was used within Problem (2.1) along with 5 = 

N 

(2.2) C{x) =mm'^\\PjX - Dhj\\\ s.t. || 4||2 = 1VA:, ||6j||Q<sVj. 

j=i 

The resulting synthesis dictionary based BCS formulation is as follows 

N 

(PO) nun ||Hx - 2/II2^ - T)6j||2 s.t. ||4||2 = 1VA:, ||6j||o < s V j. 

j=i 

Here, > 0 is a weight for the measurement fidelity term {\\Ax — yll^), and Pj € 
represents the operator that extracts a y/n x ^/n 2D patch as a vector PjX G C” from the 
image x. A total of N overlapping 2D patches are used. The synthesis model allows each 
patch PjX to be approximated by a linear combination Dbj of a small number of columns 
from a dictionary D G where bj G is sparse. The columns of the learnt dictionary 

(represented by 4 ) 1 < k < K) in (PO) are additionally constrained to be of unit norm in 
order to avoid the scaling ambiguity [33]. The dictionary, and the image patch, are assumed 
to be much smaller than the image {n,K <C p) in (PO). Problem (PO) thus enforces all the 
N (a typically large number) overlapping image patches to be sparse in some dictionary D, 
which can be considered as a strong yet flexible prior on the underlying image. 

We use B G to denote the matrix that has the sparse codes of the patches bj as its 

columns. Each sparse code is permitted a maximum sparsity level of s <C n in (PO). Although 
a single sparsity level is used for all patches in (PO) for simplicity, in practice, different sparsity 
levels may be allowed for different patches (for example, by setting an appropriate error 
threshold in the sparse coding step of optimization algorithms [64]). For the case of MRI, 
the sensing matrix A in (PO) is Fu G the undersampled Fourier encoding matrix [64]. 

The weight v in (PO) is set depending on the measurement noise standard deviation a as 
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u = ^, where 0 is a positive constant [64]. In practice, if the noise level is unknown, it may 
be estimated. 

Problem (PO) is to learn a patch-based synthesis sparsifying dictionary {n,K <C p), and 
reconstruct the image simultaneously from highly undersampled measurements. As discussed 
before, we have previously shown significantly superior image reconstructions for MRI using 
(PO), as compared to non-adaptive compressed sensing schemes that solve Problem (1.3). 
However, the BCS Problem (PO) is both non-convex and NP-hard. Approximate iterative 
algorithms for (PO) (e.g., the DLMRI algorithm [64]) typically solve the synthesis sparse coding 
problem repeatedly, which makes them computationally expensive. Moreover, no convergence 
guarantees exist for the algorithms that solve (PO). 

2.2. Sparsifying Transform-based Blind Compressed Sensing. 

2.2.1. Problem Formulations with Sparsity Constraints. In order to overcome some of 
the aforementioned drawbacks of synthesis dictionary-based BCS, we propose using the spar¬ 
sifying transform model in this work. Sparsifying transform learning has been shown to be 
effective and efficient in applications, while also enjoying good convergence guarantees [73]. 
Therefore, we use the following transform learning regularizer [70] 

N 

(2.3) C(^) = \\WPjX — bj II 2 + A Q{W) s.t. JjBjjQ < s 

i=i 

along with the constraint set S = {x € : \\x \\2 < C} within Problem (2.1) to arrive at the 

following adaptive sparsifying transform-based BCS problem formulation 

N 

(PI) + + ll-^llo ^ \\x\\ 2 <C. 

’ j=i 

Here, W G g. square sparsifying transform for the patches of the underlying 

image. The penalty jjVPPjX — bjW^ denotes the sparsification error (transform domain residual) 
[70] for the patch, with bj denoting the transform sparse code. The sparsity term JjBjjQ = 
ll^jllo counts the number of non-zeros in the sparse code matrix B. Notice that the 
sparsity constraint in (PI) is enforced on all the overlapping patches, taken together. This is 
a way of enabling variable sparsity levels for each specific patch. The constraint JJxjjg < C 
with C > 0 in (PI), is to enforce any prior knowledge on the signal energy (or, range). For 
example, if the pixels of the underlying image take intensity values in the range 0 — 255, 
then C = 255y^ is an appropriate bound. The function Q{W) : 1 —)• M in Problem 

(PI) denotes a regularizer for the transform, and the weight A > 0. Notice that without an 
additional regularizer, VF = 0 is a trivial sparsifier for any patch, and therefore, W = 0, bj = 0 
Vj, X = A^y (assuming this x satisfies jjxjj 2 < C) with (•)f denoting the pseudo-inverse, would 
trivially minimize the objective (without the regularizer Q{W)) in Problem (PI). 

Similar to prior work on transform learning [70, 67], we set Q{W) = —log jdetkFj -|- 
0.5||iy||^ as the regularizer in the objective to prevent trivial solutions. The — log jdetkFj 
penalty eliminates degenerate solutions such as those with repeated rows. The UkF]]^ penalty 
helps remove a ‘scale ambiguity’ in the solution [70], which occurs when the optimal solution 
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satisfies an exactly sparse representation, i.e., the optimal {x,W,B) in (PI) is such that 
WPjX = bj V j, and ||i?||o < s. In this case, if the ||IP||^ penalty is absent in (PI), the 
optimal (VP, B) can be scaled by /3 G C, with |/3| —)• oo, which causes the objective to decrease 
unbounded. 

The — log |detVP| and 0.5||VP||^ penalties together also additionally help control the 
condition number k{W) and scaling of the learnt transform. If we were to minimize only the 
Q{W) regularizer in Problem (PI) with respect to W, then the minimum is achieved with 
a W that is unit conditioned, and with spectral norm (scaling) of 1 [70], i.e., a unitary or 
orthonormal transform W. Thus, similar to Corollary 2 in [70], it is easy to show that as 
A —?■ oo in Problem (PI), the optimal sparsifying transform(s) tends to a unitary one. In 
practice, transforms learnt via (PI) are typically close to unitary even for finite A. Adaptive 
well-conditioned transforms (small k{W) > 1) have been previously shown to perform better 
than adaptive (strictly) orthonormal ones in some scenarios in image representation, or image 
denoising [70, 67]. 

In this work, we set A = \qN in (PI), where Aq > 0 is a constant. This setting allows A 
to scale with the size of the data (i.e., total number of patches). In practice, the weight Aq 
needs to be set according to the expected range (in intensity values) of the underlying image, 
as well as depending on the desired condition number of the learnt transform. The weight v 
in (PI) is set similarly as in (PO). 

When a unitary sparsifying transform is preferred, the Q{W) regularizer in (PI) (and in 
(2.3)) could instead be replaced by the constraint W^W = I, where I denotes the identity 
matrix, yielding the following formulation 

N 

(P2) min u\\Ax — y\\ 2 + '^^\\WPjX — bjW^ s.t. W^W = I, [[BHq < s, ||a ;||2 < C. 

i=i 

The unitary sparsifying transform case is special, in that Problem (P2) is also a unitary 
synthesis dictionary-based blind compressed sensing problem, with denoting the synthesis 
dictionary. This follows from the identity [[lyPjX — bj \\2 = \\PjX — W^bj\\ 2 , for unitary W. 

2.2.2. Properties of Transform BCS Formulations - Identifiability and Uniqueness. 

The following simple proposition considers an “error-free” scenario and establishes the global 
identifiability of the underlying image and sparse model in BCS via solving the proposed 
Problems (PI) or (P2). 

Proposition 2.1. Let x £ with ||x ||2 < C, and let y = Ax with A G . Sup¬ 
pose that W G is a unitary transform that sparsifies the collection of patches of x as 

ll^-Pj^llo — '®- Further, let B denote the matrix that has WPjX as its columns. Then, 
(x, VV^, B) is a global minimizer of both Problems (PI) and (P2), i.e., it is identifiable by solving 
these problems. 

Proof. : For the given {x,W,B), the terms Ylf=i W'^Fjx — bjW^ and ||Ax — 7 /II 2 in (PI) 
and (P2) each attain their minimum possible value (lower bound) of zero. Since W is unitary, 
the penalty QiW) in (PI) is also minimized by the given W. Notice that the constraints in 
both (PI) and (P2) are satisfied for the given {x,W,B). Therefore, this triplet is feasible for 
both problems and achieves the minimum possible value of the objective in both cases. Thus, 
it is a global minimizer of both (PI) and (P2). ■ 
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Thus, when “error-free” measurements are provided, and the patches of the underlying 
image are exactly sparse (as defined by the constraint in (PI)) in some unitary transform. 
Proposition 2.1 guarantees that the image as well as the model are jointly identifiable by 
solving (i.e., finding global minimizers in) (PI) (or, (P2)). 

An interesting topic, which we do not fully pursue here pertains to the condition(s) under 
which the underlying image in Proposition 2.1 is the unique minimizer of the proposed BCS 
problems. The proposed problems do admit an equivalence class of solutions/minimizers with 
respect to the transform W and the set of sparse codes Given a particular minimizer 

{x,W,B) of (PI) or (P2), we have that {x,QW,QB) is another equivalent minimizer for all 
sparsity-preserving unitary matrices 0, i.e., 0 such that 0^0 = I and Yl,j ll®^jllo — 
example, 0 can be a row permutation matrix, or a diagonal ±1 sign matrix. Importantly 
however, the minimizer with respect to x in (PI) or (P2) is invariant to the modification of 
{W,B) by sparsity-preserving unitary matrices 0, i.e., the optimal x in (PI) or (P2) remains 
the same for all such choices of {QW, QB). 

We note that by imposing additional structure on W in our transform-based BCS for¬ 
mulations, one can derive conditions for the uniqueness of the minimizers. Assume a global 
minimizer (x, W, B) exists in (P2) satisfying the conditions (error-free scenario) in Proposition 
2.1. Then, for example, when the unitary transform W in (P2) is further constrained to be 
doubly sparse, i.e., W = with S a sparse matrix and a known matrix (e.g., DCT, or 
Wavelet ^h), then because is an equivalent (doubly sparse) synthesis dictionary 

(corresponding to the transform VP), the uniqueness conditions (involving the spark condi¬ 
tion) proposed in prior work on synthesis dictionary-based BCS [30] (Section V-A of [30]) can 
be extended to the transform-based setting here. A detailed analysis and description of such 
uniqueness results will be presented elsewhere. 

2.2.3. Problem Formulations with Sparsity Penalties. While Problem (PI) involves a 
sparsity constraint, an alternative version of Problem (PI) is obtained by replacing the £o 
sparsity constraint with an (.q penalty in the objective (and in (2.3)), in which case we have 
the following optimization problem 

N 

(P3) min \\WPjX — bj\\\-\-v\\Ax — y\\\-\-XQ{W)-\-rj^ \\B\\q s.t. IJxjjo < C. 

x,W,B ^^ 

1=1 

where with r] > Q, denotes the weight for the sparsity penalty. 

A version of Problem (P3) (without the £2 constraint) has been used very recently in 
adaptive tomographic reconstruction [54, 55]. However, it is interesting to note that in the 
absence of the llxjj 2 < C condition, the objective in (P3) is actually non-coercive. To see this, 
consider W = I and = xq + I3z, where xq is a particular solution to y = Ax, /3 G M, and 
z G M{A) with M{A) denoting the null space of A. For this setting, as /3 —)• 00 with the 
j**^ sparse code in (P3) set to WPjXp, it is obvious that the objective in (P3) remains finite, 
thereby making it non-coercive. The energy constraint on x in (P3) restricts the set of feasible 


^In the remainder of this work, when certain indexed variables are enclosed within braces, it means that 
we are considering the set of variables over the range of all the indices. 
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images to a compact set, and alleviates potential problems (such as unbounded iterates within 
a minimization algorithm) due to the non-coercive objective. 

While a single weight rf is used for the sparsity penalty ||i?||Q = II^jIIq (P3), one 

could also use different weights 'qj for the sparsity penalties ||6jHQ corresponding to different 
patches, if such weights are known, or estimated. 

Just as Problem (P3) is an alternative to Problem (PI), we can also obtain a corresponding 
alternative version (denoted as (P4)) of Problem (P2) by replacing the sparsity constraint with 
a penalty. Although, in the rest of this paper, we consider Problems (P1)-(P3), the proposed 
algorithms and convergence results in this work easily extend to the case of (P4). 

2.3. Extensions. While the proposed sparsifying transform-based BCS problem formu¬ 
lations are for the (extreme) scenario when the CS measurements corresponding to a single 
image are provided, these formulations can be easily extended to other scenarios too. For 
example, when multiple images (or frames, or slices) have to be jointly reconstructed using a 
single adaptive (spatial) sparsifying transform, then the objectives in Problems (P1)-(P3) for 
this case are the summation of the corresponding objective functions for each image. In appli¬ 
cations such as dynamic MRI (or for example, compressive video), the proposed formulations 
can be extended by considering adaptive spatiotemporal sparsifying transforms of 3D patches 
(cf. [80] that extends Problem (PO) in such a way to compressed sensing dynamic MRI). 
Similar extensions are also possible for higher-dimensional applications such as 4D imaging. 

3. Algorithm and Properties. 

3.1. Algorithm. Here, we propose block coordinate descent-type algorithms to solve the 
proposed transform-based BCS problem formulations (P1)-(P3). Our algorithms alternate 
between solving for the sparse codes {bj} {sparse coding step), transform W {transform update 
step), and image x {image update step), with the other variables kept fixed. One could also 
alternate a few times between the sparse coding and transform update steps, before performing 
one image update step. In the following, we describe the three main steps in detail. We 
show that each of the steps has a simple solution that can be computed cheaply in practical 
applications such as MRI. 

3.1.1. Sparse Coding Step. The sparse coding step of our algorithms for Problems (PI) 
and (P2) involves the following optimization problem 

N 

(3.1) min IlkFi-’jX — H 2 s.t. ||H||g < s. 

^ i=i 

Now, let Z G igg matrix with the transformed (vectorized) patches WPjx as its 

columns. Then, using this notation, Problem (3.1) can be rewritten as follows, where ||•||^ 
denotes the standard Frobenius norm. 

(3.2) nnn||Z —R||^ s.t. ||B||q < s. 

The above problem is to project Z onto the non-convex set {R G : ||R||q < s} of matrices 

that have sparsity < s, which we call the s-io ball. The optimal projection B is easily computed 
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by zeroing out all but the s coefficients of largest magnitude in Z. We denote this operation 
hy B = Hs{Z), where Hs{-) is the corresponding projection operator. In case, there is more 
than one choice for the s elements of largest magnitude in Z, then Hs[Z) is chosen as the 
projection for which the indices of these s elements are the lowest possible in lexicographical 
order. 

In the case of Problem (P3), the sparse coding step involves the following unconstrained 
(and non-convex) optimization problem 

(3.3) \\Z — B\\^p + rf'\\B \\q 

The optimal solution B in this case is obtained as B = H}^{Z), with the hard-thresholding 
operator H^{-) defined as follows, where the subscript ij indexes matrix entries {i for row and 
j for column). 


(3.4) 



f 0 , \Zij\ < T] 

1 ! I I — ^ 


The optimal solution to Problem (3.3) is not unique when the condition \Zij\ = rj is satisfied 
for some i, j (cf. Page 3 of [73] for a similar scenario and an explanation). The definition in 

(3.4) chooses one of the multiple optimal solutions in this case. 

3.1.2. Transform Update Step. Here, we solve for W in the proposed formulations, with 
the other variables kept fixed. In the case of Problems (PI) and (P3), this involves the 
following optimization problem 


N 

(3.5) min ^ jjlTPjX - ^jj]^ + 0.5A ||IT||^ - Alog |detIT| 

^ j=i 


Now, let X G t)e the matrix with the vectorized patches PjX as its columns, and recall 

that B is the matrix of codes bj. Then, Problem (3.5) becomes 

(3.6) min \\WX - B\\l + 0.5X\\W\\l - Xlog \detW\ 

w 


An analytical solution for this problem has been recently derived [67, 73], and is stated in the 
following proposition. It is expressed in terms of an appropriate singular value decomposition 
(SVD). We let Mz denote the positive dehnite square root of a positive definite matrix M. 

Proposition 3.1. Given X e , B G andX> 0, factorize XX^+ 0.5X1 as LL^, 

with L G Further, let L~^XB^ have a full SVD ofVTjR^. Then, a global minimizer 

for the transform update step (3.6) is 

(3.7) IP = 0.572 + (s2 + 2A/)^)p^L-^ 

The solution is unique if and only if XB^ is non-singular. Furthermore, the solution is 
invariant to the choice of factor L. 
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Proof. : See the proof of Proposition 1 of [73] , particularly the discussion following that 
proof. ■ 

The factor L in Proposition 3.1 can for example be the factor L in the Cholesky fac¬ 
torization XX^ + 0.5A/ = LL^, or the Eigenvalue Decomposition (EVD) square root of 
XX^ -\-0.5XI. The closed-form solution (3.7) is nevertheless invariant to the particular choice 
of L [73]. Although in practice both the SVD and the square root of non-negative scalars are 
computed using iterative methods, we will assume in the convergence analysis in this work, 
that the solution (3.7) (as well as later ones that involve such computations) is computed 
exactly. In practice, standard numerical methods are guaranteed to quickly provide machine 
precision accuracy for the SVD or other (aforementioned) computations. 

In the case of Problem (P2), the transform update step involves the following problem 

(3.8) nun IIITV - .B||^ s.t.W^W = I. 

The solution to the above problem can be expressed as follows (see [67], or Proposition 2 of 
[73]). 

Proposition 3.2. Given X E and B E let XB^ have a full SVD of UT,V^. 

Then, a global minimizer in (3.8) is 

(3.9) W = VU^ 

The solution is unique if and only if XB^ is non-singular. 

3.1.3. Image Update Step: General Case. In this step, we solve Problems (P1)-(P3) for 
the image x, with the other variables hxed. This involves the following optimization problem 

N 

(3.10) min '^^\\WPjX — bjW^ + n \\Ax — y \\2 s.t. ||x ||2 < C*. 

j=i 

Problem (3.10) is a least squares problem with an £2 (alternatively, squared £ 2 ) constraint [31]. 
It can be solved exactly by using the Lagrange multiplier method [31]. The corresponding 
Lagrangian formulation is 

N 

(3.11) min \\WPjX — bjW^ v \\Ax — y\\\ + /r (^||x ||2 — 

1=1 


where /z > 0 is the Lagrange multiplier. The solution to (3.11) satishes the following Normal 
Equation 

N 

(3.12) {G + u A^A + yLl)x = Y^ PjW^bj + iz A^y 

i=i 


where 


N 

G = ^PjW^WPj 

i=i 


(3.13) 
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and (•)^ (matrix transpose) is used instead of (•)^ above for real matrices. The solution to 
(3.12) is unique for any /r > 0 because matrix G is positive-definite. To see why, consider any 
z G CP. Then, we have z^Gz = which is strictly positive unless WPjZ = 0 

V j. Since the W in our algorithm is ensured to be invertible, we have that WPjz = 0 V j if 
and only Pjz = 0\/ j, which implies (assuming that the set of patches in our formulations 
covers all pixels in the image) that z = 0. This ensures G 0. The unique solution to (3.12) 
can be found by direct methods (for small-sized problems), or by conjugate gradients (CG). 

To solve the original Problem (3.10), the Lagrange multiplier ^ in (3.12) must also 
be chosen optimally. This is done by first computing the EVD of the p x p matrix 
G +nA^A as . Since G 0, we have that S 0. Then, defining z = 

{^^=iPjW^bj + uA^y^, we have that (3.12) implies U^x = {Tj + pl)~^ z. There¬ 
fore, 


(3.14) 



U^x 


2 

2 


E 


(Sjj -|- /i) 


= /(/^) 


where Zi above denotes the entry of vector z. If /(O) < G^, then /i = 0 is the optimal 
multiplier. Otherwise, the optimal p, > 0. In the latter case, since the function f{p) in (3.14) 
is monotone decreasing for p > 0, and /(O) > G^, there is a unique multiplier p > 0 such 
that f{p) — G^ = 0. The optimal p is found by using the classical Newton’s method (or, 
alternatively [25]), which in our case is guaranteed to converge to the optimal solution at a 
quadratic rate. Once the optimal p is found (to within machine precision), the unique solution 
to the image update Problem (3.10) is t/ (S -|- pl)~^ z, coinciding with the solution to (3.12) 
with p = p. 

In practice, when a large value (or, loose estimate) of G is used (for example, in our 
experiments later in Section 5), the optimal solution to (3.10) is typically obtained with the 
setting p = 0 for the multiplier in (3.11). In this case, the unique minimizer of the objective 
in (3.10) (e.g., obtained with CG) directly satisfies the constraint. Therefore, the additional 
computations (e.g., EVD) to hnd the optimal p can be avoided in this case. Other alternative 
ways to find the solution to (3.10) when the optimal /i 7 ^ 0, without the EVD computation, 
include using the projected gradient method, or solving (3.11) repeatedly (by CG) for various 
p (tuned in steps) until the ||a :||2 = G condition is satisfied. 

When employing CG (or, the projected gradient method), the structure of the various 
matrices in (3.12) can be exploited to enable efficient computations. First, we show that under 
certain assumptions, the matrix G in (3.12) is a 2D circulant matrix, i.e., a Block Circulant 
matrix with Circulant Blocks (abbreviated as BCCB matrix), enabling efficient computation 
(via FFTs) of the product Gx (used in CG). Second, in certain applications, the matrix A^A 
in (3.12) may have a structure (e.g., sparsity, Toeplitz structure, etc.) that enables efficient 
computation of A^Ax (used in CG). In such cases, the GG iterations can be performed 
efficiently. 

We now show the matrix G in (3.12) is a BCCB matrix. To do this, we make some 
assumptions on how the overlapping 2D patches are selected from the image(s) in our formu¬ 
lations. First, we assume that periodically positioned, overlapping 2D image patches are used. 
Furthermore, the patches that overlap the image boundaries are assumed to ‘wrap around’ 
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on the opposite side of the image [64]. Now, defining the patch overlap stride r to be the 
distance in pixels between corresponding pixel locations in adjacent image patches, it is clear 
that the setting r = 1 results in a maximal set of overlapping patches. When r = 1 (and 
assuming patch ‘wrap around’), the following proposition establishes that the matrix G is a 
Block Circulant matrix with Circulant Blocks. Let F £ Cp^p denote the full (2D) Fourier 
encoding matrix assumed normalized such that F = L 

Proposition 3.3. Let r = 1, and assume that all ‘wrap around’ image patches are included. 
Then, the matrix G = PjW^WPj in (3.12) is a BCCB matrix with eigenvalue decom¬ 
position F^TF, with r 0. 

Proof. : First, note that W^W = , where are the columns of the 

nxn identity matrix. Denote the i**^ row of W^W by hi. Then, the matrix eicf W^W is all 
zero except for its row, which is equal to hi. Let Gi = 'Yl!j=iPj {cieJW^W) Pj. Then, 

G = ELi G. 

Consider a vectorized image ^ G C^. Because the set of entries of GiZ is simply the set of 
inner products between hi and all the (overlapping and wrap around) patches of the image 
corresponding to z, it follows that applying Gi on 2D circularly shifted versions of the image 
corresponding to z produces correspondingly shifted versions of the output GiZ. Hence, Gi is 
an operator corresponding to 2D circular convolution. 

Now, it follows that each Gi is a BCCB matrix. Since G = ^11=1 Gi is a sum of BCCB 
matrices, it is therefore a BCCB matrix as well. Now, from standard results regarding BCCB 
matrices, we know that G has an EVD of F^TF, where F is a diagonal matrix of eigenvalues. 
It was previously established that G is positive-definite. Therefore, F 0 holds. ■ 

Proposition 3.3 states that matrix G in (3.12) is a BCCB matrix, that is diagonalizable 
by the Fourier basis. Therefore, G can be applied (via FFTs) on a vector (e.g., in CG) at 
0{plogp) cost. This is much lower than the 0{n^p) cost (n^ >> logp typically) of applying 
G = PjW^WPj directly using patch-based operations. We now show how the diagonal 

matrix F in Proposition 3.3 can be computed efficiently. First, in Proposition 3.3, in the case 
that VF is a unitary matrix (as in (P2)), the matrix G = Yl!j=iPjPj i® simply the identity 
scaled by a factor n [64]. In this case, P = nP Second, when W is non-unitary, let us assume 
that the Fourier matrix F is arranged so that its first column is the constant column (with 
entries = \/^/p). Then, the diagonal of F in this case is obtained efficiently (via FFTs) as 
y/pFai, where oi is the first column of G. Now, the first column of G can itself be easily 
computed by applying this operator (using simple patch-based operations) on z G CP that has 
a one in its first entry (corresponding to an image with a 1 in the top left corner) and zeros 
elsewhere. Note that since is extremely sparse, a\ is computed at a negligible cost. The 
aforementioned computation for hnding the diagonal of F is equivalent, of course, to hnding 
the spectrum of the operator corresponding to G, as the DFT of its impulse response. 

3.1.4. Image Update Step: Case of MRI. In certain scenarios, the optimal x in (3.10) 
can be found very efficiently. Here, we consider the case of Fourier imaging modalities, or 
more specifically, MRI, where A = Fu, the undersampled Fourier encoding matrix. In order 
to obtain an efficient solution for the x in (3.10), we assume that the k-space measurements 
in MRI are obtained by subsampling on a uniform Cartesian grid. We then show that the 
optimal multiplier fi and the corresponding optimal x can be computed without any EVD 
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computations, or CG, for MRI. 

Assume r = 1 , and that all ‘wrap around’ image patches are included in the formulation. 
Then, empowered with the diagonalization result of Proposition 3.3, we can simplify equation 
(3.12) for MRI, by rewriting it as follows 


N 

(3.15) {FGF^ + zy FF^FuF^ + ^/) Fx = F ^ PjW^bj + i/ FF^y 

i=i 


All p-dimensional vectors (vectorized images) in (3.15) are in Fourier or k-space. Vector 
FF^y £ represents the zero-filled (or, zero padded) k-space measurements. The matrix 
FF^FuF^ is a diagonal matrix consisting of ones and zeros, with the ones at those diagonal 
entries that correspond to sampled locations in k-space. By Proposition 3.3, for r = 1, the 
matrix FGF^ = P is diagonal. Therefore, the matrix pre-multiplying Fx in (3.15) is diagonal 
and invertible. Denoting the diagonal of P by 7 £ (all positive vector), Sq = FF^y, and 
S = F^^j^ Pj'W^bj, we have that the solution to (3.15) for fixed // is 


(3.16) 


Fx^ (kx ) ky) 


S{^kx ,ky ) 

Sikx So{kx ^ky^ 

'y{kx,ky)+u+fi 


, {kx, ky) ^ D 

, {kx, ky'] £ D 


where {kx, ky) indexes k-space locations, and D represents the subset of k-space that has been 
sampled. Equation (3.16) provides a closed-form solution to the Lagrangian Problem (3.11) 
for CS MRI, with Fxy{kx,ky) representing the optimal updated value (for a particular y) in 
k-space at location {kx,ky). 

The function /(/r) in (3.14) now has a simple form (no EVD needed) as follows 


(3.17) 


hk) 



\S{kx,ky)\'^ 

{kxj^m 




\S{kx,ky) + V SQ{kx,ky)^ 

(A:.X)en {l{kx,ky) + u + yf 


We check if /(O) < G^ first, before applying Newton’s method to solve f{y) = G^. The 
optimal X in (3.10) is obtained via a 2D inverse EFT of the updated Fxy in (3.16). 

The pseudocodes of the overall Algorithms Al, A 2 , and A3 corresponding to the BCS 
Problems (PI), (P2), and (P3) respectively, are shown below. An appropriate choice for 
the initial {W^,B^,x^^ in Algorithms A1-A3 would depend on the specific application. For 
example, could be the n x n 2D DOT matrix, x^ = A^y (normalized so that it satisfies 
|| 2;°||2 < C), and can be the minimizer of Problems (P1)-(P3) for these fixed and x^. 

3.2. Computational Cost. Algorithms Al, A 2 , and A3 involve the steps of sparse coding, 
transform update, and image update. We now briefly discuss the computational costs of each 
of these steps. 

First, in each outer iteration of our Algorithms Al and A3, the computation of the ma¬ 
trix XX^ -|- 0.5A/, where X has the image patches as its columns, can be done in 0{n?N) 
operations, where typically n N. The computation of the inverse square root L~^ requires 
only 0{n^) operations. 
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Transform-Based BCS Algorithms ^Al, A2, and A3 

Inputs: y - measurements obtained with sensing matrix A, s - sparsity, A - weight, v - 

weight, C - energy bound, M - number of inner iterations, J - number of outer iterations. 
Outputs: X - reconstructed image, W - adapted sparsifying transform, B - matrix with 

sparse codes of all overlapping patches as columns. 

Initial Estimates: 

For t = 1: J Repeat 

1) Form the matrix X with PjX^~^ as its columns. Compute L~^ = (^XX^ + 0.5A/) for 
Algorithms A1 and A3. Set B^ = B^~^. 

2) For 1 = 1: M Repeat 

(a) Transform Update Step: 

i. Set VYiR^ as the full SVD of L~^X for Algorithms A1 and A3, or the 

full SVD of X for Algorithm A2. 

ii. = 0.5R + (S2 -h 2A/) for Algorithms A1 and A3, or = RV^ 

for Algorithm A2. 

(b) Sparse Coding Step: & = Hg {W^X) for Algorithms A1 and A2, or & = 

(W^X) for Algorithm A3. 

End 

3) Set and R* = B^. Set 6^- as the column of B^ y j. 

4) Image Update Step: 

(a) For generic CS scheme, solve (3.12) for x* with /r = 0, by linear CG. If ||a ^*||2 > C, 

i. Compute as EVD of Pj W^Pj + i/ A. 

ii. Compute 2 : = {^^=iPj + u A^y^. 

iii. Use Newton’s method to find u. such that flu) = in (3.14). 

iv. x‘ = u {j: + fiiy^ z. 

(b) For MRI, do the following 

i. Compute the image c = Pj (W^)^ bj. S FFT{c). 

ii. Compute ai as the hrst column of y,f=i Pj W^Pj. 

iii. Set 7 •(— y/p X FFT{ai). 

iv. Compute /(O) as per (3.17). 

V. If /(O) < set y = 0. Else, use Newton’s method to solve /(p) = for fl. 
vi. Update S to be the right hand side of (3.16) with n = fi. x^ = IFFT{S). 

End 


The cost of the sparse coding step in our algorithms is dominated by the computation of the 


®The superscripts t and I denote the main iterates, and the iterates in the inner alternations between 
transform update and sparse coding, respectively. The FFT and IF FT denote the fast implementations of the 
normalized 2D DFT and 2D IDFT. For MRI, r = 1, and the encoding matrix F is assumed normalized, and 
arranged so that its first column is the constant DC column. In Step 4a, although we list the image update 
method involving EVD, an alternative scheme is one mentioned in the text in Section 3.1.3, involving only CG. 







16 


S. Ravishankar and Y. Bresler 


matrix Z = WX in (3.2) (for Algorithms Al, A2) or (3.3) (for Algorithm A3), and therefore 
scales as 0{n'^N). Notably, the projection onto the s-£q ball in (3.2) costs only 0{nN log N) 
operations, when employing sorting [70], with logiV <C n typically. Alternatively, in the case 
of (3.3), the hard thresholding operation costs only 0{nN) comparisons. 

The cost of the transform update step of our algorithms is dominated by the computation 
of the matrix XB^. Since B is sparse, XB^ is computed with an^N multiply-add operations, 
where a < 1 is the fraction of non-zeros in B. 

The cost of the image update step in our algorithms depends on the specific application 
(i.e., the specific structure of A, etc.). Here, we discuss the cost of the image update 
step discussed in Section 3.1.4, for the specific case of MRI. We assume r = 1, and that the 
patches ‘wrap around’, which implies that N = p (i.e., number of patches equals number of 
image pixels). The computational cost here is dominated by the computation of the term 
in the normal equation (3.12), which takes 0{'n?N) operations. On the other 
hand, the various FFT and IFFT operations cost only 0(A^log A^) operations, where logA^ <C 

typically. The Newton’s method to compute the optimal multiplier fi is only used when 
p = 0 is non-optimal. In the latter case, Newton’s method takes up 0{NJ) operations, with 
J being the number of iterations (typically small, and independent of n) of Newton’s method. 

Based on the preceding arguments, it is easy to observe that the total cost per (outer) 
iteration of the Algorithms A1-A3 scales (for MRI) as 0{n?NM). Now, the recent synthesis 
dictionary-based BCS method called DLMRI [64] learns a dictionary D £ from CS MRI 

measurements by solving Problem (PO). For this scheme, the computational cost per outer 
iteration scales as 0{NKnsJ) [70], where J is the number of (inner) iterations of dictionary 
learning (using the K-SVD algorithm [2]), and the other notations are the same as in (PO). 
Assuming that K oc n, and that the synthesis sparsity s oc n, we have that the cost per 
iteration of DLMRI scales as 0{rfiNJ). Thus, the per-iteration computational cost of the 
proposed BCS schemes is much lower (lower in order by factor n assuming M ^ J) than 
that for synthesis dictionary-based BCS. This gap in computations is amplified for higher¬ 
dimensional imaging applications such as 3D or 4D imaging, where the size of the 3D or 4D 
patches is typically much bigger than in the case of 2D imaging. 

As illustrated in our experiments in Section 5, the proposed BCS algorithms converge 
in few iterations in practice. Therefore, the per-iteration computational advantages over 
synthesis dictionary-based BCS also typically translate to a net computational advantage in 
practice. 

4. Convergence Results. Here, we present convergence guarantees for Algorithms Al, 
A2, and A3, that solve Problems (PI), (P2), and (P3), respectively. These problems are 
highly non-convex. Notably they involve either an Iq penalty or constraint for sparsity, a non- 
convex transform regularizer or constraint, and the term [[ITPjX — that is a non-convex 
function involving the product of unknown matrices and vectors. The proposed algorithms for 
Problems (P1)-(P3) are block coordinate descent-type algorithms. We previously established 
in Proposition 2.1, the (noiseless) identifiability of the underlying image by solving the pro¬ 
posed problems (specifically, (PI) or (P2)). We are now interested in understanding whether 
the proposed algorithms converge to a minimizer of the corresponding problems, or whether 
they possibly get stuck in non-stationary points. Due to the high degree of non-convexity in- 
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volved here, standard results on convergence of block coordinate descent methods (e.g., [77]) 
do not apply here. 

Very recent works on the convergence of block coordinate descent-type algorithms (e.g., 
[82], or the Block Coordinate Variable Metric Forward-Backward algorithm [14]) prove con¬ 
vergence of the iterate sequence (for specific algorithm) to a critical point of the objective. 
However, these works make numerous assumptions, some of which can be easily shown to 
be violated for the proposed formulations (for example, the term \\WPjX — bjW^ in the 

objectives of our formulations, although differentiable, violates the L-Lipschitzian gradient 
property described in Assumption 2.1 of [14]). 

In fact, in certain simple scenarios, one can easily derive non-convergent iterate sequences 
for the Algorithms A1-A3. Non-convergence mainly arises for the transform or sparse code 
sequences (rather than the image sequence) due to the fact that the optimal solutions in 
the sparse coding or transform update steps may be non-unique. For example, in the trivial 
case when y = 0, the image a; = 0 is the unique minimizer in the proposed BCS problems. 
Hence, if = 0, then the iterates in Algorithms A1-A3 easily satisfy x* = 0 V t. Since the 
patches of x* are all zero, then any unitary matrix would be an (non-unique) optimal sparsifier 
in the transform update step of the algorithms, no matter how many inner alternations are 
performed between sparse coding and transform update within each outer iteration of the 
algorithms (the sparse codes are always 0 in this case). Hence, the |hF*} sequence can be any 
sequence of unitary transforms in this case (a limit need not exist). 

In this work, we provide convergence results for the proposed BCS approaches, where the 
only assumption is that the various steps in our algorithms are solved exactly. (Recall that 
machine precision is typically guaranteed in practice.) Unlike prior works on dictionary-based 
blind compressed sensing [64, 80], wherein the update steps such as the Iq “norm” based syn¬ 
thesis sparse coding are only solved approximately, our analysis here for the transform based 
BCS approaches makes use of the explicit solutions for the update steps in our algorithms, to 
prove convergence. 

4.1. Preliminaries. We first list some definitions that will be used in our analysis. 

Definition 4.1. For a function ^ i—>■ (—oo,-t-oo], its domain is defined as domcj) = 
{ 2 ; E : (j){z) < -|-oo}. Function (p is proper if domp is nonempty. 

Next, we define the notion of Frechet sub-differential for a function as follows [74, 49]. 
The norm and inner product notations used below correspond to the euclidean £2 settings. 

Definition 4.2. Let p : (—oo,-|-oo] be a proper function and let z E dom(/>. The 

Frechet sub-differential of the function f at z is the following set: 

(4.1) d(j){z) 4 |/i E M'' : ^liminf {fib) - (f){z) -{b-z, h)) > 0| 

If z ^ domc/i, then d(p{z) = 0. The sub-differential of f at z is defined as 

(4.2) dfiz) = {h gR'I :3zk ^ z, pizk) fiz), hk E d(l){zk) h] ■ 


A necessary condition for z E to be a minimizer of the function (/> : M*? 1 —)• (— 00 , -t- 00 ] is 
that z is a critical point of p, i.e., 0 E dp{z). If is a convex function, this condition is also 
sufficient. Critical points can be thought of as “generalized stationary points” [74]. 
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We say that a sequence {«*} with a* € has an accumulation point a, if there is a 
subsequence that converges to a. 

4.2. Notations. Problems (PI) and (P2) have the constraint ||i?||Q < s, which can instead 

be added as a penalty in the respective objectives by using a barrier function (which 

takes the value +oo when the sparsity constraint is violated, and is zero otherwise). Problem 
(P2) also has the constraint W^W = /, which can be equivalently added as a penalty in the 
objective of (P2) by using the barrier function ip(W), which takes the value +oo when the 
unitary constraint is violated, and is zero otherwise. Finally, the constraint ||x ||2 < C in our 
formulations is replaced by the barrier function penalty x(^)- With these modifications, all 
the proposed problem formulations can be written in an unconstrained form. The objectives 
of (PI), (P2), and (P3), are then respectively denoted as 

N 

giW, B,x) = u \\Ax - y\\l + \\WP,x - bj\\l + A Q(W) + V'(B) + x{x) 

i=i 

N 

u{W, B,x) = u \\Ax - y\\l + ^ \\WPjX - bj\\l + ip{W) + V>(B) + x(x) 

j=i 
N 

v{W, B,x) = u Px - y\\l + Y, WWPjX - bj\\l + A Q(W) + rf ||B||o + x(x) 

i=i 

It is easy to see that (cf. [73] for a similar statement and justification) the unconstrained 
minimization problem involving the objective giyV, B, x) is exactly equivalent to the corre¬ 
sponding constrained formulation (PI), in the sense that the minimum objective values as 
well as the set of minimizers of the two formulations are identical. The same result also holds 
with respect to (P2) and u{W,B,x), or (P3) and v{W,B,x). 

Since the functions g, u, and v accept complex-valued (input) arguments, we will compute 
all derivatives or sub-differentials (Definition 4.2) of these functions with respect to the (real¬ 
valued) real and imaginary parts of the variables {W, B, x). Note that the functions g, u, 
and V are proper (we set the negative log-determinant penalty to be -|-oo wherever det W = 0) 
and lower semi-continuous. For the Algorithms A1-A3, we denote the iterates (outputs) in 
each outer iteration t by the set (kF*, B^,x^). 

For a matrix B, we let Pj{H) denote the magnitude of the largest element (magnitude- 
wise) of the matrix H. For some matrix E, = maxjj- \Eij\. Finally, Re{A) denotes the 

real part of some scalar or matrix A. 

4.3. Main Results. The following theorem provides the convergence result for Algorithm 
A1 that solves Problem (PI). We assume that the initial estimates {W^,B^,x^) satisfy all 
problem constraints. 

Theorem 4.3. Let {W^,B^ , X*} denote the iterate sequence generated by Algorithm Al 
with measurements y E C'" and initial {W^,B^,x^). Then, the objective sequence {< 7 *} 
with g^ = g(W^,B^,x^) is monotone decreasing, and converges to a finite value, say 
g* = g*(W^, B^, x^). Moreover, the iterate sequence is bounded, and all its 

accumulation points are equivalent in the sense that they achieve the exact same value g* of 


(4.3) 

(4.4) 

(4.5) 
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the objective. The sequence {a*| with a* = ||x* — II 2 , converges to zero. Finally, every 
aecumulation point {W,B,x) of is a eritieal point of the objective g satisfying 

the following partial global optimality eonditions 


(4.6) 

X Garg min g (IT, B, x) 

X 

(4.7) 

W Garg min g 1 W, B, x ) 


w ^ ’ 

(4.8) 

B Garg min g (W,B,xj 

Eaeh aecumulation point 
tions 

(IT, B, x) also satisfies the following partial heal optimality condi 

(4.9) 

g{W + dW,B + AB, x) >g{W, B, x) = g* 

(4.10) 

g{W, B + AB, X + Ax) >g{W, B, x) = g* 


The conditions each hold for all Ax G and all sufficiently small dW G satisfying 

for some e' > 0 that depends on the speeific W, and all AB G in the union 

of the following regions R1 and R2, where X G is the matrix with PjX, 1 < J < iV, as 

its columns. 

Rl. The half-space Re {tr{{WX - B)AB^}) < 0. 

R2. The local region defined by ||Ai?||^ < psiWX). 

Furthermore, i/||VFA||Q < s, then AB ean be arbitrary. 

Theorem 4.3 establishes that for each initial point {W^,B^,x^), the iterate sequence in 
Algorithm A1 converges to an equivalence class of accumulation points. Specihcally, every 
accumulation point corresponds to the same value g* = g*{W^,B^,x^) of the objective. The 
exact value of g* could vary with initialization. Importantly, the equivalent accumulation 
points are all critical points as well as at least partial minimizers of the objective g {W,B,x), 
in the following sense. Every accumulation point {W, B, x) is a partial global minimizer of 
g{W,B,x) with respect to each of W, B, and x, as well as a partial local minimizer of 
g{W,B,x) with respect to {W,B), and {B,x), respectively. Therefore, we have the following 
corollary to Theorem 4.3. 

Corollary 4.4. For each {W^,B^,x^), the iterate sequence in Algorithm A1 converges to 
an equivalence class of critical points, that are also partial minimizers satisfying (4.6), (4.7), 
(4.8), (4.9), and (4.10). 

Conditions (4.9) and (4.10) in Theorem 4.3 hold true not only for local (or small) per¬ 
turbations of the sparse code matrix (accumulation point) B, but also for arbitrarily large 
perturbations of the sparse codes in a half space, as defined by region Rl. Furthermore, the 
partial optimality condition (4.10) also holds for arbitrary perturbations Ax of x. 

Theorem 4.3 also says that ||x* —x *“^||2 —>■ 0. This is a necessary but not sufficient 
condition for convergence of the entire sequence {a:*}- 

The following corollary to Theorem 4.3 also holds, where ‘globally convergent’ refers to 
convergence from any initialization. 
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Corollary 4.5. Algorithm A1 is globally convergent to a subset of the set of critical points of 
the non-convex objective g {W,B,x). The subset includes all critical points {W,B,x), that are 
at least partial global minimizers of g{W,B,x) with respect to each of W, B, and x, as well 
as partial local minimizers of g{W,B,x) with respect to each of the pairs {W,B), and {B,x). 

Theorem 4.3 holds for Algorithm Al irrespective of the number of inner alternations 
M, between transform update and sparse coding, within each outer algorithm iteration. In 
practice, we have observed that a larger value of M (particularly in initial algorithm iterations) 
may enable Algorithm Al to be insensitive (for example, in terms of the quality of the image 
reconstructed) to the initial (even, badly chosen) values of and B^. 

The convergence results for Algorithms A2 or A3 are quite similar to that for Algorithm 
Al. The following two Theorems briefly state the results for Algorithms A3 and A2, respec¬ 
tively. 

Theorem 4.6. Theorem f.3 applies to Algorithm A3 and the corresponding objective 
v{W,B,x) as well, except that the set of perturbations AB € in Theorem j.3 is re¬ 

stricted to ||Ai?||^ < t ]/2 for Algorithm A3. 

Theorem 4.7. Theorem 4-3, except for the condition (4.9), applies to Algorithms A2 and 
the corresponding objective u(W,B,x) as well. 

Note that owing to Theorems 4.6 and 4.7, results similar to Corollaries 4.4 and 4.5 also 
apply for Algorithms A3 and A2, respectively. The proofs of the stated convergence theorems 
are provided in Appendix A. 

In general, the subset of the set of critical points to which each algorithm converges may 
be larger than the set of global minimizers (Section 2.2.2) in the respective problems. The 
question of the conditions under which the proposed algorithms converge to the (perhaps 
smaller) set of global minimizers in the proposed problems is open, and its investigation is left 
for future work. 

5. Numerical Experiments. 

5.1. Framework. Here, we study the usefulness of the proposed sparsifying transform- 
based blind compressed sensing framework for the CS MRI application The MR data 
used in these experiments are 512 x 512 complex-valued images shown (only the magnitude 
is displayed) in Fig. 1(a) and Fig. 2(a). The image in Fig. 1(a) was kindly provided 
by Prof. Michael Lustig, UC Berkeley, and is one image slice (with rich features) from a 
multislice data acquisition. The image in Fig. 2(a) is publicly available We simulate 
various undersampling patterns in k-space ® including variable density 2D random sampling ^ 
[76, 64], and Cartesian sampling with (variable density) random phase encodes (ID random). 

^We have also proposed another sparsifying transform-based BCS MRI method recently [71]. However, the 
latter approach involves many more parameters (e.g., error thresholds to determine patch-wise sparsity levels), 
which may be hard to tune in practice. In contrast, the methods proposed in this work involve only a few 
parameters that are relatively easy to set. 

®It can be downloaded from http://web.stanford.edu/class/ee369c/data/brain.mat. A phase-shifted 
version of the image was used in the experiments in [72]. 

®We simulate the k-space of an image x using the command fftshift(fft2(ifftshift(x))) in Matlab. Fig. 2(b) 
shows the k-space (only magnitude is displayed) of the reference in Fig. 2(a). 

^This sampling scheme is feasible when data corresponding to multiple image slices are jointly acquired, 
and the frequency encode direction is chosen perpendicular to the image plane. 
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We employ Problem (PI) and the corresponding Algorithm A1 to reconstruct images from 
undersampled measurements in the experiments here ®. Our reconstruction method is referred 
to as Transform Learning MRI (TLMRI). 

First, we illustrate the empirical convergence behavior of TLMRI. We also compare the 
reconstructions provided by the TLMRI method to those provided by the following schemes: 
1) the Sparse MRI method of Lustig et al [38], that utlilizes Wavelets and Total Variation 
as fixed sparsifying transforms; 2) the DLMRI method [64] that learns adaptive overcomplete 
sparsifying dictionaries; 3) the PANO method [63] that exploits the non-local similarities 
between image patches (similar to [15]), and employs a 3D transform to sparsify groups 
of similar patches; and 4) the PBDWS method [51]. The PBDWS method is a very recent 
partially adaptive sparsifying transform based compressed sensing reconstruction method that 
uses redundant Wavelets and trained patch-based geometric directions. It has been shown to 
perform better than the earlier PBDW method [62]. 

We simulated the performance of the Sparse MRI, PBDWS, and PANO methods using 
the software implementations available from the respective authors’ websites [37, 61, 59]. 
We used the built-in parameter settings in those implementations, which performed well in 
our experiments. Specihcally, for the PBDWS method, the shift invariant discrete Wavelet 
transform (SIDWT) based reconstructed image is used as the guide (initial) image [51, 61]. 
We employed the zero-filling reconstruction (produced within the PANO demo code [59]) as 
the initial guide image for the PANO method [63, 59]. 

The implementation of the DLMRI algorithm that solves Problem (PO) is also available 
online [68]. For DLMRI, image patches of size 6 x 6 (n = 36) are used, as suggested in [64] 
and a four fold overcomplete synthesis dictionary (AT = 144) is learnt using 25 iterations 
of the algorithm. A patch overlap stride of r = 1 is used, and 14400 (found empirically 
^°) randomly selected patches are used during the dictionary learning step (executed for 20 
iterations) of the DLMRI algorithm. Mean-subtraction is not performed for the patches prior 
to the dictionary learning step of DLMRI. A maximum sparsity level (of s = 7 per patch) is 
employed together with an error threshold (for sparse coding) during the dictionary learning 
step. The £2 error threshold per patch varies linearly from 0.48 to 0.15 over the DLMRI 


® Problem (P3) has been recently shown to be usefnl for adaptive tomographic reconstruction [54, 55]. 
The corresponding Algorithm A3 has the advantage that the sparse coding step involves the cheap hard 
thresholding operation, rather than the more expensive projection onto the s-£o ball (used in Algorithms Al 
and A2). We have observed that Algorithm A3 also works well for MRI. Problems (PI) and (P2) differ in that 
(P2) enforces unit conditioning of the learnt transform, whereas (PI) also allows for more general condition 
numbers. Algorithm A2 (for (P2)) involves slightly cheaper computations than Algorithm Al (for (PI)). In 
our experiments for MRI, we observed that well-conditioned transforms learnt via (PI) performed (in terms 
of image reconstrnction qnality) slightly better than strictly unitary learnt transforms. Therefore, we show 
results for (PI) in this work. We did not observe any dramatic difference in performance between the proposed 
methods in our experiments here. A detailed investigation of scenarios and applications where one of (PI), 
(P2), or (P3), performs the best (in terms of reconstruction quality compared to the others) is left for future 
work on specific applications. In this work, we have emphasized more the properties of these formulations, and 
the novel convergence guarantees of the corresponding algorithms. 

®The reconstrnction quality improves slightly with a larger patch size, but with a substantial increase in 
runtime. 

^^Using a larger training size (> 14400) during the dictionary learning step of the algorithm provides negli¬ 
gible improvement in final image reconstruction quality, while leading to increased runtimes. 
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iterations. These parameter settings (all other parameters are set as per the indications in 
the DLMRI-Lab toolbox [68]) were observed to work well for the DLMRI algorithm. 

The parameters for TLMRI (with Algorithms Al) are set to n = 36, r = 1 (with patch 
wrap around), v = 3.81, M = 1, Aq = 0.2, and C = 10^. The sparsity level s = 0.055xnA^ (this 
corresponds to an average sparsity level per patch of 0.055 xn, or 5.5% sparsity) , where N = 
512^, is used in our experiments except in Section 5.2, where s = 0.045xreA^ is used. The initial 
transform estimate is the (simple) patch-based 2D DCT [70], and the initial image x® is set 
to be the standard zero-filling Fourier reconstruction The initial sparse code settings are 
the solution to (3.1), for the given {W^, x^). Our TLMRI implementation was coded in Matlab 
version R2013a. Note that this implementation has not been optimized for efficiency. A link 
to the Matlab implementation is provided at http://www.ifp.illinois.edu/~yorani/. All 
simulations in this work were executed in Matlab. All computations were performed with an 
Intel Core i5 CPU at 2.5GHz and 4GB memory, employing a 64-bit Windows 7 operating 
system. 

Similar to prior work [64], we quantify the quality of MR image reconstruction using 
the peak-signal-to-noise ratio (PSNR), and high frequency error norm (HFEN) metrics. The 
PSNR (expressed in decibels (dB)) is computed as the ratio of the peak intensity value of 
some reference image to the root mean square reconstruction error (computed between image 
magnitudes) relative to the reference. In MRI, the reference image is typically the image 
reconstructed from fully sampled k-space data. The HFEN metric quantifies the quality of 
reconstruction of edges or finer features. A rotationally symmetric Laplacian of Gaussian 
(LoG) filter is used, whose kernel is of size 15 x 15 pixels, and with a standard deviation of 
1.5 pixels [64]. HEEN is computed as the £2 norm of the difference between the LoG filtered 
reconstructed and reference magnitude images. 

5.2. Convergence and Learning Behavior. In this experiment, we consider the reference 
image in Eig. 1(a). We perform four fold undersampling of the k-space space of the (peak 
normalized ^^) reference. The (variable density [64]) sampling mask is shown in Fig. 1(b). 
When the TLMRI algorithm is executed using the undersampled data, the objective function 
converges monotonically and quickly over the iterations as shown in Fig. 1(e). The changes 
between successive iterates ||x* —(Fig. 1(g)) converge towards 0. Such convergence 
was established by Theorem 4.3, and is indicative (a necessary but not suffficient condition) 
of convergence of the entire sequence {x*}. As far as the performance metrics are concerned, 
the PSNR metric (Fig. 1(f)) increases over the iterations, and the HFEN metric decreases, 

^^The sparsity level s is a regularization parameter in our framework that provides a trade-off between how 
much aliasing is removed over the algorithm iterations, and how much image information is kept or restored 
(i.e., not eliminated by the sparsity condition). We determined the sparsity level empirically in the experiments 
in this work. 

^^While we used the naive zero-filling Fourier reconstruction in our experiments here for simplicity, one 
could also use other better initializations for x such as the SIDWT based reconstructed image [51], or the 
reconstructions produced by recent methods (e.g., PBDWS, etc.). We have observed empirically that better 
initializations may lead to faster convergence of TLMRI, and TLMRI typically only improves the image quality 
compared to the initializations (assuming properly chosen sparsity levels). 

^®In practice, the data or k-space measurements can always be normalized prior to processing them for image 
reconstruction. Otherwise, the parameter settings for algorithms may need to be modified to account for data 
scaling. 
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Figure 1: Convergence of TLMRI with 5x undersampling: (a) Reference image; (b) sampling 
mask in k-space; (c) initial zero-filling reconstruction (26.93 dB); (d) TLMRI reconstruction 
(30.54 dB); (e) objective function (since the regularizer Q{W) > nj^ [70], we have subtracted 
out the constant offset of nA/2 from the objective values here); (f) PSNR and HFEN; (g) 
changes between successive iterates (||x* — x*“^|| 2 ), and (h) real (top) and imaginary (bottom) 
parts of the learnt VF, with the matrix rows shown as patches. 


indicating improving reconstruction quality over the algorithm iterations. These metrics also 
converge quickly. 

The initial zero-filling reconstruction (Fig. 1(c)) shows aliasing artifacts that are typical 
in the undersampled measurement scenario, and has a PSNR of only 26.93 dB. On the other 
hand, the final TLMRI reconstruction (Fig. 1(d)) is much enhanced (by 3.6 dB), with a 
PSNR of 30.54 dB. Since Algorithm Al is guaranteed to converge to the set of critical points 
of Problem (PI), the result in Fig. 1(d) suggests that, in practice, the set of critical points 
may in fact include images that are close to the true image. Note that our identihability result 
(Proposition 2.1) in Section 2.2 ensured global optimality of the underlying image only in a 
noiseless (or error-free) scenario. The learnt transform W {k{W) = 1.01) for this example is 
shown in Fig. 1(h). This is a complex valued transform. Both the real and imaginary parts 
of W display texture or frequency like structures, that sparsify the patches of the MR image. 
Our algorithm is thus able to learn this structure and reconstruct the image using only the 
nndersampled measurements. 

5.3. Comparison to Other Methods. In the following experiments, we execute the 
TLMRI algorithm for 40 iterations (all other parameters are set to the values mentioned 
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Image 

Sampling Scheme 

Undersampling 

Zero-filling 

Sparse MRI 

PBDWS 

PANO 

DLMRI 

TLMRI 

2 

2D Random 

4x 

25.3 

26.13 

31.69 

32.80 

33.01 

33.12 

1 

2D Random 

5x 

26.9 

27.84 

30.27 

30.37 

30.49 

30.56 

2 

2D Random 

7x 

25.3 

26.38 

31.10 

30.92 

31.70 

31.94 

2 

Cartesian 

4x 

28.9 

29.73 

31.67 

32.24 

32.67 

32.78 

2 

Cartesian 

7x 

27.9 

28.58 

31.11 

31.08 

30.91 

31.24 


Table 1: PSNRs corresponding to the Zero-filling, Sparse MRI [38], PBDWS [51], PANO [63], 
DLMRI [64], and TLMRI reconstructions, for various sampling schemes and undersampling 
factors. The best PSNRs are marked in bold. 


in Section 5.1). We also use a lower sparsity level (< 0.055 x nN) during the initial several 
algorithm iterations, which leads to faster convergence. We consider the complex-valued ref¬ 
erence images in Fig. 1(a) and Fig. 2(a) that are labeled as Image 1 and Image 2 respectively, 
and simulate variable density 2D random or Cartesian undersampling [64] of the k-spaces of 
these images. Table 1 lists the reconstruction PSNRs corresponding to the zero-filling. Sparse 
MRI, PBDWS, PANO, DLMRI, and TLMRI^^ reconstructions for various cases. 

The TLMRI algorithm is seen to provide the best PSNRs (analogous results were observed 
to typically hold with respect to the HFEN metric not shown in the table) for the various 
scenarios in Table 1. Significant improvements (up to 7 dB) are observed over the Sparse 
MRI method, that uses fixed sparsifying transforms. Moreover, TLMRI provides up to 1.4 
dB improvement in PSNR over the recent (partially adaptive) PBDWS method, and up to 
1 dB improvement over the recent non-local patch similarity-based PANO method. Finally, 
the TLMRI reconstruction quality is somewhat (up to 0.33 dB) better than DLMRI. This is 
despite the latter using a 4 fold overcomplete (i.e., larger or richer) dictionary. 

Fig. 2 shows the TLMRI reconstruction (Fig. 2(d)) of Image 2 for the case of 2D random 
sampling (sampling mask shown in Fig. 2(c)) and seven fold undersampling. The reconstruc¬ 
tion errors (i.e., the magnitude of the difference between the magnitudes of the reconstructed 
and reference images) for several schemes are shown in Figs. 2 (e)-(h). The error map for 
TLMRI clearly shows the smallest image distortions. Fig. 3 shows the TLMRI (Fig. 3(c)) 
and DLMRI (Fig. 3(e)) reconstructions and reconstruction error maps (Figs. 3 (d), (f)) for 
Image 2 with Cartesian sampling (sampling mask shown in Fig. 3(b)) and seven fold under¬ 
sampling. TLMRI provides a better reconstruction of image edges and better aliasing removal 
than DLMRI in this case. 

The average run times of the Sparse MRI, PBDWS, PANO, DLMRI, and TLMRI algo¬ 
rithms in Table 1 are 251 seconds, 797 seconds, 400 seconds, 3273 seconds, and 243 seconds, 
respectively. The PBDWS run time includes the time taken for computing the initial SIDWT 
based reconstruction or guide image [51] in the PBDWS software package [61]. The TLMRI 

observed that in Table 1, if we sampled vertical (instead of horizontal) lines for the Cartesian sampling 
patterns (by transposing the Cartesian sampling masks used in Table 1), the reconstructed images for TLMRI 
had PSNRs (32.52 dB and 31.05 dB respectively, for the 4x and 7x undersampling cases) similar to the horizontal 
lines sampling case. However, the learnt sparsifying transforms for the two cases had many dissimilar rows. 
This is not surprising since prior work on transform learning [73] has empirically shown that even the patches 
of a single image may admit multiple equally good sparsifying transforms, that may not be related by only row 
permutations or sign changes. 
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(e) (f) (g) (h) 


Figure 2: Results for 2D random sampling and 7x undersampling: (a) Reference image; (b) 
k-space of reference; (c) sampling mask in k-space; (d) TLMRI reconstruction (31.94 dB); 
(e) magnitude of PBDWS [51] reconstruction error; (f) magnitude of PANO [63] reconstruc¬ 
tion error; (g) magnitude of DLMRI [64] reconstruction error; and (h) magnitude of TLMRI 
reconstruction error. 


algorithm is thus the fastest one in Table 1, and provides a speedup of about 13.5x over 
the synthesis dictionary-based DLMRI, and a speedup of about 3.3x and 1.6x over the PB¬ 
DWS and PANO methods, respectively. Note that the speedups for TLMRI over PBDWS 
or PANO were obtained by comparing our unoptimized Matlab implementation of TLMRI 
against the MEX (or C) implementations of PBDWS and PANO. 

While our results show some (preliminary) potential for the proposed sparsifying 
transform-based blind compressed sensing framework (for MRI), a much more detailed inves¬ 
tigation will be presented elsewhere. Combining the proposed scheme with the patch-based 
directional Wavelets ideas [62, 51], or non-local patch similarity ideas [63, 48], or extending 
our framework to learning overcomplete sparsifying transforms (c.f., [81]) could potentially 
boost transform-based BCS performance further. 

6. Conclusions. In this work, we presented a novel sparsifying transform-based frame¬ 
work for blind compressed sensing. Our formulations exploit the (adaptive) transform domain 
sparsity of overlapping image patches in 2D, or voxels in 3D. The proposed formulations are 
however highly nonconvex. Our block coordinate descent-type algorithms for solving the pro¬ 
posed problems involve efficient update steps. Importantly, our algorithms are guaranteed to 
converge to the critical points of the objectives defining the proposed formulations. These crit- 

Another faster version of the PANO method is also publicly available [60]. However, we found that although 
this version has an average run time of only 40 seconds in Table 1, it also provides worse reconstruction PSNRs 
than [59] in Table 1. 
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Figure 3: Cartesian sampling with 7 fold undersampling: (a) Initial zero-filling reconstruction 
(27.9 dB); (b) sampling mask in k-space; (c) TLMRI reconstruction (31.24 dB); (d) magnitude 
of TLMRI reconstruction error; (e) DLMRI reconstruction (30.91 dB); and (f) magnitude of 
DLMRI reconstruction error. 


ical points are also guaranteed to be at least partial global and partial local minimizers. Our 
numerical examples showed the usefulness of the proposed scheme for magnetic resonance im¬ 
age reconstruction from highly undersampled k-space data. Our approach while being highly 
efficient also provides promising MR image reconstruction quality. The usefulness of the pro¬ 
posed blind compressed sensing methods in other inverse problems and imaging applications 
merits further study. A detailed investigation of the theoretical rate of convergence in our 
algorithms is also of potential interest. 

Appendix. Convergence Proofs. 

A.l. Proof of Theorem 4.3. In this proof, we let Hs{Z) denote the set of all optimal 
projections of Z G g_£^ g £^nxN . ||2?||q < i.e., Hs{Z) is the set of 

minimizers for the following problem. 

(A.l) Hs{Z) = argmin \\Z — B\\^p 

Let denote the iterate sequence generated by Algorithm Al with measure¬ 
ments y G and initial Assume that the initial {W^,B^,x^) is such that 

g [W^, B^, x^) is finite. Throughout this proof, we let A* be the matrix with PjX^ (1 < i < N) 
as its columns. The various results in Theorem 4.3 are now proved in the following order. 
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(i) The sequence of values of the objective in Algorithm A1 converges. 

(ii) The iterate sequence generated by Algorithm A1 has an accumulation point. 

(hi) All the accumulation points of the iterate sequence are equivalent in terms of their 
objective value. 

(iv) Every accumulation point of the iterates is a critical point of the objective g {W,B^x) 
satisfying (4.6), (4.7), and (4.8). 

(v) The difference between successive image iterates Ijx* — converges to zero. 

(vi) Every accumulation point of the iterates is a local minimizer of g {W, B, x) with respect 
to {B, x) or (VE, B). 

The following two Lemmas establish the convergence of the objective, and the boundedness 
of the iterate sequence. 

Lemma A.l. Let , X*} denote the iterate sequence generated by Algorithm Al with 

input y E C™ and initial {W^,B^,x^). Then, the sequence of objective function values 
{g (W^,B^,x^)'^ is monotone decreasing, and converges to a finite value g* = g* {W^, B^, x^). 

Proof. : Algorithm Al first alternates between the transform update and sparse coding 
steps (Step 2 in algorithm pseudocode), with fixed image x. In the transform update step, 
we obtain a global minimizer (i.e., (3.7)) with respect to W for Problem (3.5). In the sparse 
coding step too, we obtain an exact solution for B in Problem (3.2) as B = Hs{Z). Therefore, 
the objective function can only decrease when we alternate between the transform update 
and sparse coding steps (similar to the case in [73]). Thus, we have g , B^~^^, x*) < 

g{W\B\x^). 

In the image update step of Algorithm Al (Step 4 in algorithm pseudocode), we obtain 
an exact solution to the constrained least squares problem (3.10). Therefore, the objective in 
this step satisfies g (IT*+^, B^~^^< g (IT*+^, x*). By the preceding arguments, we 

have g (IT*+^, < g (W^,B^,x^), for every t. Now, every term, except XQ{W), in 

the objective (4.3) is trivially non-negative. Furthermore, the Q{W) regularizer is bounded 
as QiW) > ^ (cf. [70]). Therefore, the objective g{W,B,x) > 0. Since the sequence 
{g{W\B\x^ )} is monotone decreasing and lower bounded, it converges. ■ 

Lemma A.2. The iterate sequence |IT*, i?*,x*| generated by Algorithm Al is bounded, and 
it has at least one accumulation point. 

Proof. : The existence of a convergent subsequence (and hence, an accumulation point) 
for a bounded sequence is a standard result. Therefore, we only prove the boundedness of the 
iterate sequence. 

Since ||x *||2 < C V t trivially, we have that the sequence {x*} is bounded. We now show 
the boundedness of {VE*}. Let us denote the objective g{W^,B^,x^) as g^. It is obvious that 
the squared £2 norm terms and the barrier functions and x(x*) in the objective g^ (4.3), 

are non-negative. Therefore, we have 

(A.2) AQ(IT*) < 5 ' < <7° 

where the second inequality follows from Lemma A.l. Now, the function Q(hE*) = 
^”^ 1 ( 0 .5q;^ — log Qfj), where ai {1 < i < n) are the singular values of VE^ is a coercive 
function of the (non-negative) singular values, and therefore, it has bounded lower level sets 
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Combining this fact with (A.2), we can immediately conclude that 3 cq G M depending on 
and A, such that ||iy*||p < cq V t. 

Finally, the boundedness of {-B*} follows from the following arguments. First, for Algo¬ 
rithm A1 (see pseudocode), we have that B* = Hg Therefore, by the definition of 

we have 

(A.3) ||B‘||^ = ||F, < ||bF*||2||x*-i||p 

Since, by our previous arguments, and are both bounded by constants independent 
of t, we have that the sequence of sparse code matrices {B*}, is also bounded. ■ 

We now establish some key optimality properties of the accumulation points of the iterate 
sequence in Algorithm Al. 

Lemma A.3. All the accumulation points of the iterate sequence generated by Algorithm Al 
with a given initial correspond to a common objective value g*. Thus, they are 

equivalent in that sense. 

Proof. : Consider the subsequence B*?*, (indexed by qt) of the iterate sequence, 

that converges to the accumulation point (W*, B*, x*). Before proving the lemma, we establish 
some simple properties of {W*, B*,x*). First, equation (A.2) implies that — log |det | < 
{g^/X), for every t. This further implies that |det > e~^ > 0 V t. Therefore, due to the 

continuity of the function |det W\, the limit W* of the subsequence is also non-singular, with 
|detkF*| > Second, B'^‘ = Hg where is the matrix with Pjx'^*~^ 

(1 < j < N) as its columns. Thus, B"?* trivially satisfies '0(B'?*) = 0 for every t. Now, {B*?*} 
converges to B*, which makes B* the limit of a sequence of matrices, each of which has no 
more than s non-zeros. Thus, the limit B* obviously cannot have more than s non-zeros. 
Therefore, ||B*||g < s, or equivalently il>{B*) = 0. Finally, since satisfies the constraint 
\\xHl < we have x(a^'^*) = 0 V t. Additionally, since x*?* ^ x* as t —)• oo, we also have 
||x *||2 = lim 4 _>.oo ||x'^‘II 2 < C^. Therefore, x{x*) = 0. Now, it is obvious from the above 
arguments that 


(A.4) lim = x(3;*)) li™ V'(B'^*) = V^(B*) 

t—>-oo t—>-oo 

Moreover, due to the continuity of Q{W) at non-singular matrices W, we have 
= Q{W*). We now use these limits along with some simple properties of 
convergent sequences (e.g., the limit of the sum/product of convergent sequences is equal to 
the sum/product of their limits), to arrive at the following result. 


(A.5) 


lim 5 (W''‘,B''%x''‘) = 

t—>-oo 


lim 

t—>-oo 


Z^,=l 


Wi*Pjxi^ - bf 


N 

+ lim x(x^*) = V||kF*B,x* 


\u\\Ax<}^ -y\\l + XQ{W’i*) + 'il;{B‘}^)j 
-b*\\l + n ||Ax* - y\\l + XQ{W*) + ^P{B*) + x{x*) 


= g{W*,B*,x*) 


^®The lower level sets of a function f ■. A C R"' !-->■ R (where A is unbounded) are bounded if limt_>oo f{z*) = 
-1-00 whenever {2*} C A and linit_>oo ||2^|| = 00. 
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The above result together with the fact (from Lemma A.l) that = g* 

implies that g{W*,B*^x*) = g*. ■ 

The following lemma establishes partial global optimality with respect to the image, of 
every accumulation point. 

Lemma A.4. Any accumulation point iW*,B*,x*) of the iterate sequence generated by 
Algorithm Al satisfies 

(A.6) X* ^ arguiin g {W*, B* ,x) 


Proof. : Consider the subsequence , S'?*, (indexed by qt) of the iterate sequence, 
that converges to the accumulation point (W*,B*,x*). Then, due to the optimality of in 
the image update step of Algorithm Al, we have the following inequality for any t and any 
X E CP. 


(A.7) 


N 

i=i 



N 


+ v\\Ax-y\\l + x{x) 


b- 


2 

2 


Now, by (A.4), we have that limt_>.oo x(x'i‘) = x(x*). Taking the limit t —)• oo term by term 
on both sides of (A.7) for some fixed x E CP yields the following result 


N 

E 

1=1 


N 


\W*PjX* 




(A.8) 


+ x{x*)<^\\W 
1=1 

+ n \\Ax - y\\l + x{x) 




Since the choice of x in (A.8) is arbitrary, (A.8) holds for any x E CP. Recall that 
fi{B*) = 0 and Q{W*) is hnite based on the arguments in the proof of Lemma A.3. Therefore, 
(A.8) implies that g{W*,B*,x*) < g{W*,B*,x) V x E CP. This establishes the result (A.6) 
of the Lemma. ■ 

The following lemma will be used to establish that the change between successive image 
iterates ||x* — converges to 0. 

Lemma A.5. Consider the subsequence , B'^^, x^^} (indexed by qt) of the iterate se¬ 
quence, that converges to the accumulation point {W*, B*,x*). Then, the subsequence 
also converges to x*. 

Proof. : First, by Lemma A.3, we have that 


(A.9) g{W*,B*,x*)=g* 

Next, consider a convergent subsequence of (the bounded sequence) that 

converges to say x**. Now, applying the same arguments as in Equation (A.5) (in the proof 
of Lemma A.3), but with respect to the (convergent) subsequence }, we 

have that 


lim 5(1 TP’*sR''"‘,x‘'"‘"^) = g{W*,B*,x**) 


(A.IO) 
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Now, the monotonic decrease of the objective in Algorithm A1 implies 
(A.11) 

Taking the limit t —)• oo all through (A. 11) and using Lemma A.l (for the extreme left/right 
limits), and Equation A.10 (for the middle limit), immediately yields that g{W*, B*,x**) = g*. 
This result together with (A.9) implies g{W*, B*, x**) = g{W*, B*,x*). 

Now, we know by Lemma A.4 that 

(A. 12) X* G argmin g{W*,B*,x) 

X 

Furthermore, it was shown in Section 3.1.3 that if the set of patches in our formulation (PI) 
cover all pixels in the image (always true for the case of periodically positioned overlapping 
image patches), then the minimization of g{W*,B*,x) with respect to x has a unique so¬ 
lution. Therefore, x* is the unique minimizer in (A.12). Combining this with the fact that 
g{W*, B*,x**) = g{W*, B*, x*) yields that x** = x*. Since we worked with an arbitrary 
convergent subsequence (of in the above proof, we have that x* is the limit 

of any convergent subsequence of Finally, since every convergent subsequence of 

the bounded sequence converges to x*, it means that the (compact) set of accumu¬ 
lation points of coincides with x*. Since x* is the unique accumulation point of a 

bounded sequence, we therefore have [44] that the sequence itself converges to x*, 

which completes the proof. ■ 

Lemma A.6. The iterate sequence in Algorithm A1 satisfies 

(A.13) lim ||x* — = 0 

t—^OO ' ^ 

Proof. : Consider the sequence {a*} with a* = ||x* We will show below that 

every convergent subsequence of the bounded sequence {a*} converges to 0, thereby implying 
that 0 is both the limit inferior and limit superior of {a*}, which means that the sequence 
{o*} itself converges to 0. 

Now, consider a convergent subsequence of {a*}. Since the sequence 
is bounded, there exists a convergent subsequence converging to say 

(W*, B*, X*). By Lemma A.5, we then have that also converges to x*. Thus, 

the subsequence with a'?"-* = converges to 0. Since, itself is a 

subsequence of a convergent sequence, we must have that converges to the same limit 

(i.e., 0). We have thus shown that zero is the limit of any convergent subsequence of {a*}. H 

The next property is the partial global optimality of every accumulation point with respect 
to the sparse code. In order to establish this property, we need the following result. 

Lemma A.7. Consider a bounded matrix sequence with G that converges 

to Z* . Then, every accumulation point of [Hs{Z^)^ belongs to the set Hs{Z*). 

Proof. : The proof is very similar to that for Lemma 10 in [73]. ■ 

Lemma A.8. Any aceumulation point iW*,B*,x*) of the iterate sequence generated by 
Algorithm Al satisfies 

(A.14) B* G argmin g{W*,B,x*) 

B 
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Moreover, denoting by X* € matrix whose column is PjX* , for 1 < j < N, the 

above condition can be equivalently stated as 

(A.15) B* G Hs{W*X*) 


Proof. : Consider the subsequence S'?*, (indexed by qt) of the iterate sequence, 

that converges to the accumulation point {W*,B*,x*). Then, by Lemma A.5, con¬ 

verges to X*, and the following inequalities hold. 

(A.16) B* = lim = lim H, G Hs{W*X*) 

t^OO t^OO '' '' 

Since —)• W*X* as t —>■ oo, the last containment relationship above follows by 

Lemma A.7. While we have proved (A.15), the result in (A. 14) now immediately follows by 
applying the definition of Hs(-) from (A.l). ■ 

The next result pertains to the partial global optimality with respect to W, of every 
accumulation point in Algorithm Al. We use the following lemma (simple extension of Lemma 
1 in [73] to the complex field) to establish the result. 

Lemma A. 9. Consider a sequence {M^} with Mk G that converges to M. For each 

k, let VkT^kPk denote a full SVD of Mk- Then, every accumulation point {V,Ti,R) of the 
sequence {14,Sfc,i?fc} is such that VER^ is a full SVD of M. In particular, {S*,} converges 
to S, the n X n singular value matrix of M. 

Lemma A. 10. Any accumulation point {W*,B*,x*) of the iterate sequence generated by 
Algorithm Al satisfies 

(A.17) W* G argmin g{W,B*,x*) 

w 

Proof. : The proposed Algorithm Al involves M alternations between the transform 
update and sparse coding steps in every outer iteration. At a certain (outer) iteration t, let us 
denote the intermediate transform update outputs as , 1 < I < M. Then, IT* = 

We will only use the sequence jlTd’*) | |;] 2 e intermediate outputs for Z = 1) in the following 

proof. 

Consider the subsequence {W^‘, S'?*, x*?*} (indexed by qt) of the iterate sequence in 
Algorithm Al, that converges to the accumulation point (W*, B*,x*). Then, lyd^st+i) 
is computed as follows, using the full SVD (i?*?*)^, of (L'J‘)~^ (il'?*)^, where 

L?* = (V9‘)^ + 0.5A/y^^ 

(A.18) ^{L<?t+i) = Q + (^(S«‘)^ + 2A/) (V‘'‘)^ (L'?*)"^ 


To prove the lemma, we will consider the limit t —>■ oo in (A.18). In order to take this limit, we 
need the following results. First, due to the continuity of the matrix square root and matrix 
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inverse functions at positive definite matrices, we have that the following limit holds, where 
X* G has PjX* {I < j < N) as its columns. 

^lim = ^hm + 0.5A/) = (^X* (X*)^ + 0.5A/) 

1 /2 

Next, defining L* = ^X* (X*)^ + 0.5A/^ , we also have that 

(A.19) lim X* {B*)^ 

t^OO 

Applying Lemma A.9 to (A.19), we have that every accumulation point (V*,T,*, R*) of the 
sequence is such that {R*)^ is a full SVD of (L*)“^ X* (B*)^. Now, con¬ 
sider a convergent subsequence {!/'?"*, S'?"*, of {!/'?% S'?*, with limit {V*,Tj*, R*). 

Then, taking the limit t —?> oo in (A. 18) along this subsequence, we have 

IT** = ^lim = 0.5R* ^S* -h (^(S*)^ 2A/) (T*)^ (L*)"^ 

Combining this result with the aforementioned definitions of the square root L* and the full 
SVD V*S* {R*)^, and applying Proposition 3.1, we get 

(A.20) IT** G argmin g{W,B*,x*) 

w 

Finally, applying the same arguments as in the proof of Lemma A.3 to the subsequence 
js*?"*, a:'?"*, IT^^’'?"*’''^) |, we easily get that g* = g{W**, B*, x*). Since, by Lemma A.3, we 
also have that g* = g{W*,B*,x*), we get g(W*,B*,x*) = g{W**, B*,x*), which together 
with (A.20) immediately establishes the required result (A.17). ■ 

The following lemma establishes that every accumulation point of the iterate sequence 
in Algorithm A1 is a critical point of the objective g{W,B,x). All derivatives or sub¬ 
differentials are computed with respect to the real and imaginary parts of the corresponding 
variables/vectors/matrices below. 

Lemma A. 11. Every accumulation point {W*,B*,x*) of the iterate sequence generated by 
Algorithm Al is a critical point of the objective g{W,B,x) satisfiying 

(A.21) 0£dg{W*,B\x*) 


Proof. : Consider the subsequence {VT?*, S'?*, x*?*} (indexed by qt) of the iterate sequence, 
that converges to the accumulation point {W*, B*,x*). By Lemmas A.10, A.4, and A.8, we 


have that 


(A.22) 

IT* G argmin g{W,B*,x*) 


w 

(A.23) 

X* G argmin g{W*,B*,x 

X 

(A.24) 

B* G arg min g (IT*, B, x* 


B 
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The function giyV, B,x) is continuously differentiable with respect to W at non-singular points 
W. Since W* is a (non-singular) partial global minimizer in (A.22), we have as a necessary 
condition that Vwd (W*, B*,x*) = 0. Next, recall the statement in Section 4.1, that a 
necessary condition for z G to be a minimizer of some function (/> : i-G (—oo,-|-oo] is 

that z satisfies 0 G d(j){z). Now, since x* and B* are partial global minimizers in (A.23) and 
(A.24), respectively, we therefore have that 0 G dgx {W*,B*,x*) and 0 G dgs {W*,B*,x*), 
respectively. It is also easy to derive these conditions directly using the definition (Definition 
4.2) of the sub-differential. 

Finally, (see Proposition 3 in [3]) the subdifferential dg at {W*,B*,x*) satishes 

(A.25) dg {W*,B*,x*) = Vw9 (W*, B*,x*) x dgs {W*,B*,x*) x dg^ {W*, B*,x*) 

Now, using the preceding results, we easily have that 0 G dg (W*, B*,x*) above. Thus, every 
accumulation point in Algorithm A1 is a critical point of the objective. ■ 

The following two lemmas establish pairwise partial local optimality of the accumulation 
points in Algorithm Al. Here, X* G is the matrix with PjX* as its columns. 

Lemma A.12. Every accumulation point {W*,B*,x*) of the iterate sequence generated by 
Algorithm Al is a partial minimizer of the objective g{W, B, x) with respect to iW, B), in the 
sense of (4.9), for sufficiently small dW G and all AB G in the union of the 

regions R1 and R2 in Theorem 4-3. Furthermore, i/||VF*A*||g < s, then the AB in (4.9) can 
be arbitrary. 

Proof. : Consider the subsequence {W^^, S'?*, (indexed by qt) of the iterate sequence, 
that converges to the accumulation point {W*,B*,x*). First, by Lemmas A.8 and A.10, we 
have 

(A.26) B* G Hs{W*X*) 

(A.27) 2W*X* {X*)^ - 2B* (X*)^ + AIT* - A = 0 

where (A.27) follows from the first order conditions for partial global optimality of W* in 
(A.17). The accumulation point {W*,B*,x*) also satisfies 'f{B*) = 0 and x{x*) = 0. 

Now, considering perturbations dW G and AB G , we have that 

g{W* + dW, B* + AB, x*) = ||1T*A* - B* + {dW)X* - AB\\], 

(A.28) + 1 / ||Ax* - y\\l + 0.5A ||IF* + dW\\], - A log |det {W* + dW)\ + f{B* + AB) 

In order to prove the condition (4.9) in Theorem 4.3, it suffices to consider sparsity preserving 
perturbations AB, that is AB G such that B* -|- AB has sparsity < s. Otherwise 

g(W* + dW,B* + AB,x*) = -Loo > g{W*, B*,x*) trivially. Therefore, we only consider 
sparsity preserving AB in the following, for which f{B* + AB) = 0 in (A.28). 

Since the image x* is fixed here, we can utilize (A.26) and (A.27), and apply similar 
arguments as in the proof of Lemma 9 (equations (43)-(46)) in [73], to simplify the right hand 
side in (A.28). The only difference is that the matrix transpose (•)^ operations in [73] are 
replaced with Hermitian (•)^ operations here, and the operation {Q, R) involving two matrices 
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(or, vectors) Q and R in [73] is redefined here as 
(A.29) (Q, R) = Re {tr(Qi?^)} 

Upon such simplifications, we can conclude [73] that 3e' > 0 depending on W* such that 
whenever ||(iPU||^ < e', we have 

(A.30) g{W* + dW, B* + AB, x*) > g{W\ B*,x*) - 2 {W*X* - B*, AB) 

Consider first the case of AB in region Rl. Then, the term — {W*X* — B*,AB) above is 
trivially non-negative for such AB in Theorem 4.3, and therefore, g{W* + dW, B* -\-AB, x*) > 
g{W*,B*,x*) for AB G Rl. 

Next, consider the case of AB in region R2. Then, when ||1 T*X*||q > s, it is easy to 
see that any such sparsity preserving AB in region R2 in Theorem 4.3 will have its sup¬ 
port (i.e., non-zero locations) contained in the support of B* G Hs{W*X*). Therefore, 
(W*X* — B*,AB) = 0 in this case. On the other hand, if ||1 T*A*||q < s, then by (A.26), 
W*X* — B* = 0. Therefore, by these arguments, {W*X* — B*,AB) = 0 for any sparsity 
preserving AB G R2. This result together with (A.30) implies g(W* -|- dW,B* -|- AB,x*) > 
g{W*,B*,x*) for any AB G R2. 

Finally, if ||lT*X*||g < s, then since W*X* — B* = 0 in (A.30), therefore in this case, AB 
in (4.9) can be arbitrary. ■ 

Lemma A.13. Every accumulation point {W*,B*,x*) of the iterate sequence generated by 
Algorithm A1 is a partial minimizer of the objective g{W,B,x) with respect to {B,x), in the 
sense of (4.10), for all Ax G C^, and all AB G in the union of the regions Rl and R2 

in Theorem f.3. Furthermore, i/||lT*Ai *||q < s, then the AB in (4.10) can be arbitrary. 

Proof. : Consider the subsequence (indexed by qt) of the iterate sequence, 

that converges to the accumulation point {W*,B*,x*). It follows from Lemmas A.8 and A.4 
that if{B*) = x{x*) = 0. Now, considering perturbations AB G whose columns are 

denoted as Abj {1 < j < N), and Ax G C^, we have that 

N 

g{W*,B* + AB, X* + Ax) = ^ 

i=i 

(A.31) + A Q{W*) + n || Ax* -y + AAx\\l + if{B* + AB) + x{x* + Ax) 

In order to prove the condition (4.10) in Theorem 4.3, it suffices to consider sparsity preserving 
perturbations AB, that is AB G such that B* -|- AB has sparsity < s. It also suffices 

to consider energy preserving perturbations Ax, which are such that ||x* -|-Axjl^ < C. For 
any other AB or Ax, g{W*,B* + AB,x* + Ax) = -Loo > g{W*,B*,x*) trivially. Therefore, 
we only consider the energy/sparsity preserving perturbations in the following, for which 
ip{B* -L AB) = 0 and x{x* + Ax) = 0 in (A.31). Now, upon expanding the squared £2 terms 

include the Re{-) operation in the definition here, which allows for simpler notations in the rest of the 
proof. However, the (Q,R) defined in (A.29) is no longer the conventional inner product. 


\W*PiX* -b* + W*RAx - AE 
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in (A.31), and dropping non-negative perturbation terms, we get 

N 

g{W*,B*+AB,x* + Ax) > g{W*,B*,x*) -2^{W*PjX* - b*,Abj) 

i=i 

N 

(A.32) + 2 X] W*PjAx^ + 2v (Ax* - y, AAx"j 

i=i 

where the (•,•) notation is as dehned in (A.29). Now, using arguments identical to those 
used with (A.30), it is clear that the term —2 Ylf=i (w*PjX* — b*, Abj'j > 0 in (A.32) for all 
sparsity preserving AB E R1 U R2. Furthermore, since by Lemma A.4, x* is a global minimizer 
of g{W*, B*, x), it must satisfy the Normal equation (3.12) for some (unique) // > 0. Using 
these arguments, (A.32) simplifies to 

(A.33) g{W*,B*+AB, x* + Ax) > g{W*,B*,x*) - 2/r (^x*, Ax^ 

Now, if (the optimal) y = 0 above, then g{W*,B* + AB,x* + Ax) > g{W*,B*,x*) for 
arbitrary Ax E C^, and AB E R1 U R2. The alternative scenario is the case when (the 
optimal) /r > 0, which can occur only if ||x *||2 = C holds. Since we are considering 

energy preserving perturbations Ax, we have ||x* + Ax ||2 < which implies 

-2(^x*, Ax^ > IIAxII^ > 0. Combining this result with (A.33), we again have (now for the 

/i > 0 case) that g{W*,B* + AB,x* + Ax) > g{W*,B*,x*) for arbitrary Ax E C^, and 
AB E R1UR2. ■ 

A.2. Proofs of Theorems 4.6 and 4.7. The proofs of Theorems 4.6 and 4.7 are very 
similar to that for Theorem 4.3. We only discuss some of the minor differences, as follows. 

First, in the case of Theorem 4.6, the main difference in the proof is that the non-negative 
barrier function 'iIj{B) and the operator 77s(-) (in the proof of Theorem 4.3) are replaced by 
the non-negative penalty ll^jllo operator H)lj{-), respectively. Moreover, the 

mapping 77s(') defined in (A.l) for the proof of Theorem 4.3, is replaced by the matrix-to-set 
mapping Hn{') (for the proof of Theorem 4.6) defined as 

Zij I < T] 

Zij I = T] 

Zij I > 7] 

By thus replacing the relevant functions and operators, the various steps in the proof of 
Theorem 4.3 can be easily extended to the case of Theorem 4.6. As such. Theorem 4.6 mainly 
differs from Theorem 4.3 in terms of the definition of the set of allowed (local) perturbations 
AB. In particular, the proofs of partial local optimality of accumulation points in Theorem 
4.6 are easily extended from the aforementioned proofs for Lemmas A.12 and A.13, by using 
the techniques and inequalities presented in Appendix F of [73]. 

Finally, the main difference between the proofs of Theorems 4.3 and 4.7 is that the non¬ 
negative A QiW) penalty in the former is replaced by the barrier function y^{W) (that enforces 


(A.34) 


77^(Z) ] = <( 0} 

Zig 
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the unitary property, and keeps always bounded) in the latter. Otherwise, the proof 
techniques are very similar for the two cases. 
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